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Abstract 

Human mortality datasets can be expressed as multiway data arrays, the dimensions of 
which correspond to categories by which mortality rates are reported, such as age, sex, country 
and year. Regression models for such data typically assume an independent error distribution, 
or an error model that allows for dependence along at most one or two dimensions of the 
data array. However, failing to account for other dependencies can lead to inefficient estimates 
of regression parameters, inaccurate standard errors and poor predictions. An alternative to 
assuming independent errors is to allow for dependence along each dimension of the array using a 
separable covariance model. However, the number of parameters in this model increases rapidly 
with the dimensions of the array, and for many arrays, maximum likelihood estimates of the 
covariance parameters do not exist. In this paper, we propose a submodel of the separable 
covariance model that estimates the covariance matrix for each dimension as having factor 
analytic structure. This model can be viewed as an extension of factor analysis to array-valued 
data, as it uses a factor model to estimate the covariance along each dimension of the array. We 
discuss properties of this model as they relate to ordinary factor analysis, describe maximum 
likelihood and Bayesian estimation methods, and provide a likelihood ratio testing procedure 
for selecting the factor model ranks. We apply this methodology to the analysis of data from 
the Human Mortality Database, and show in a cross-validation experiment how it outperforms 
simpler methods. Additionally, we use this model to impute mortality rates for countries that 
have no mortality data for several years. Unlike other approaches, our methodology is able to 
estimate similarities between the mortality rates of countries, time periods, and sexes and use 
this information to assist with the imputations. 
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1 Introduction 



Human mortality data are used extensively by researchers and policy makers to analyze historic 
and current population trends and assess long-term impacts of public policy initiatives. To enable 
such inference, numerous regression models have been proposed that estimate mortality rates as 



a function of age using a small number of parameters (Heligman and Pollard (1980), Mode and 



Busby (1982), Siler (1983)). Practitioners using these methods typically model the age-specific 



death rates for each country, year, and sex combination separately and assume independent error 
distributions. Examples of death rates analyzed by such methods are shown in Figure [T] for the 
United States and Sweden. Each mortality curve is defined by 23 age-specific death rates and the 
average sex-specific mortality curve from 1960-1980 over thirty-eight countries is also displayed. 

From the figure, it is clear that a country's mortality rates in one time period are similar to 
its rates in adjacent time periods. Acknowledging this fact, several researchers have developed 
models for "dynamic life tables" , i.e. matrices of mortality rates for combinations of ages and time 
periods, for single country-sex combinations. An example of such a life table is the male death 
rates in Sweden from 1960 to 1980 shown in Figure [1} Some of the models developed for these data 



specify ARIMA processes for the time- varying model parameters (McNown and Rogers (1989), 



Renshaw and Haberman (2003b)), while others smooth the death rates over age and time using 



a kernel smoother (Felipe et al. ( |2001 )), p-splines (Currie et al. (2004)), nonseparable age-time 
period covariance functions (Martinez-Ruiz et al. ( 2010[ )), or multiplicative effects for age and time 



(Lee and Carter (1992), Renshaw et al. (1996), Renshaw and Haberman (2003a), Renshaw and 



Haberman (2003c), Chiou and Miiller (2009)). 



Human mortality datasets typically provide mortality rates of populations corresponding to 
combinations of several factors. For example, the Human Mortality Database (HMD, Human 
Mortality Database, 2011) provides mortality rates of populations corresponding to combinations 
of 40 countries, 9 time periods, 23 age groups, and both male and female sexes. As is shown in 
Figure [TJ mortality rates of men and women within a country will typically both be higher than or 
both lower than the sex-specific rates averaged across countries. Furthermore, differences between 
male and female mortality rates generally show trends that are consistent across countries and time 
periods. Such patterns suggest joint estimation of mortality rates using a model that can share 
information across levels of two or more factors. Two models that consider death rates for more 
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Figure 1: Mortality curves for the United States of America and Sweden. The gradient of colors 
for each country represents the log death rates in the four 5-year time periods from 1960 to 1980. 
The average sex-specific mortality curve over the four time periods and all countries is shown in 
black. 



than one country or sex are that developed by Li and Lee ( 2005 ) , which estimates common age and 



time period effects for a group of countries or both sexes, and Carter and Lee (1992), where male 
and female death rates within the same country share a time-varying mortality level. Although 
these methods consider either both sexes or multiple countries, the extreme similarity of the curves 
in Figure 1 for males across countries and for a given country across sexes suggest that separately 
modeling death rates for different countries or sexes is inefficient, and inference may be improved 
by using a joint model that shares information across all factors. 

With this in mind, we consider a regression model for the HMD data consisting of a mean model 
that is a piecewise-polynomial in age with additive effects for country, time period, and sex (more 
details on this model, and its comparison to other models, are provided in Section 4). This mean 
model is extremely flexible: It contains over 370 parameters and an ordinary least squares (OLS) fit 
accounts for over 99% of the total variation in the data (coefficient of determination, R 2 > 0.995). 
Nonetheless, an analysis of the residuals from the OLS fit indicates that some clear patterns in the 
data are not captured by the regression model, and in particular, a model of independent errors is 
a poor representation of these data. To illustrate this, note that the residuals can be represented 
as a 4-way array, the dimensions of which are given by the number of levels of each of the four 
factors: country, time period, sex and age. To examine residual correlation across levels of a factor, 
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the 4-way array of residuals can be converted into a matrix whose columns represent the levels of 
the factor, and a sample correlation matrix for the factor can be obtained. Figure [2] summarizes 
the patterns in the residual correlations using the first two principal components of each sample 
correlation matrix. If a model of independent errors were to be adequate, we would expect the 
sample correlation values to be small and centered about zero, and no discernible patterns to exist 
in the principal components. However, the sample correlations are substantially more positive than 
would be expected under independence: 59% of the observed country correlations, 61% of time 
period correlations and 98% of age correlations are greater than the corresponding 95% theoretical 
percentiles under the independence assumption. Additionally, there are clear geographic, temporal, 
and age trends in the principal components in Figure[2j For example, the residuals for the Ukrainian 
mortality rates are positively correlated to those for Russia, and the residuals for the year 2000 
are positively correlated with those for 1995. This residual analysis suggests that an assumption 
of uncorrelated errors is inappropriate on account of the positive correlation exhibited by residuals 
across levels of the factors. 
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Figure 2: The first two principal components of each sample correlation matrix are displayed, and 
countries in the same United Nations region are shown in the same color. Close proximity in the 
principal components space away from the origin is indicative of a positive correlation. 

Failure to recognize correlated errors can lead to a variety of inferential problems, such as 
inefficient parameter estimates and inaccurate standard errors. For the analysis of the mortality 
data, an additional important consequence is that the accuracy of predictions of missing mortality 
rates may suffer. Predicting missing death rates is a primary application of modeling mortality 
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data, as developing countries often lack reliable death registration data. It is possible that the 
residual dependence could be reduced by increasing the flexibility of the mean model, but since 
this is already fairly complex, we may instead prefer to represent residual dependence with a 
covariance model, leading to a general linear model for the data in which the mean function and 
residual covariance are estimated simultaneously. 

The mortality data, like the residuals, can be represented as a 4-way array, each dimension of 
which corresponds to one of the factors of country, time period, sex and age. In the literature 



on multiway array data (see, for example, Kroonenberg (2008)), each dimension is referred to as 
a mode of the array, so a 4-way array of mortality data consists of four modes. As described by 



Hoff (2011), a natural covariance model for a if -way data array is a separable covariance model, 



parameterized in terms of K covariance matrices, one for each mode of the array. If the array is 
also assumed to be normally distributed, the model is referred to as the array normal model, and 
can be seen as an extension of the matrix normal model (Dawid ( 1981[ )). 

Even though the separable covariance model is not a full, unstructured covariance model, the 
array normal likelihood is unbounded for many array dimensions, prohibiting the use of maximum 



likelihood methods (Manceur and Dutilleul (2013)). Estimates of the array normal covariance pa- 



rameters can still be obtained by taking a Bayesian approach (Hoff (2011)) or by using a penalized 



likelihood (Allen and Tibshirani (2010)). However, the lack of existence of the maximum likeli- 
hood estimates (MLEs) indicates that the data is unable to provide information about all of the 
parameters. In this article we propose an alternative modeling approach that parameterizes the 
covariance matrix of each mode as a reduced rank matrix plus a diagonal matrix. This covariance 
structure is commonly associated with factor analysis, and is referred to here as factor analytic 
covariance structure. We call this new model the Separable Factor Analysis (SFA) model, as it is 
an extension of factor analysis to array-valued data. The reduction in the number of parameters 
by using covariance matrices with factor analytic structure leads to existence of MLEs for the SFA 
parameters in many cases when the MLEs of the array normal parameters do not exist, as well as 
a parsimonious representation of mode-specific covariance in an array- valued dataset. 

This article is outlined as follows: In the next section we introduce and motivate SFA, as well 
as discuss its properties and similarities to ordinary factor analysis. We describe two estimation 
procedures in Section [3j an iterative maximum likelihood algorithm and a Metropolis-Hastings 
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sampler for inference in a Bayesian framework. A likelihood ratio testing procedure for selecting 
the rank of the factor model for each mode is also presented. In Section [4] the SFA model is used to 
analyze the HMD mortality data and its performance is compared to simpler covariance models in 
a simulation study. We illustrate how SFA uses estimated similarities between country mortality 
rates to provide imputations for countries missing mortality data for several years. This prediction 



method extends the approach taken in Coale and Demeny (1966), Brass (1971), United Nations 



(1982), and Murray et al. (2003), where one country's mortality curve is modeled a function of 



another's. Our approach is novel in that it estimates the covariance between mortality rates across 
all countries, time periods, and sexes, and uses these relationships to impute missing death rates. 
We conclude with a discussion in Section [U 



2 Extending factor analysis to arrays 
2.1 Motivating separable factor analysis 

Suppose Y is a K-way array of dimension m\ x mi x ... x mj(. We are interested in relating 
the data Y to explanatory variables X through the model Y = M(X, (3) + E, where f3 represents 
unknown regression coefficients and E represents the deviations from the mean. As was discussed 
in the preliminary analysis of the mortality data in Section [TJ it is often unreasonable to assume 
the elements of E are independent and identically distributed. 

In cases where there is no independent replication, estimation of the Cov[22] can be problematic 
as it must be based on essentially a single sample. One solution is to approximate the covariance 
matrix with one with simplified structure. A frequently used model in spatio-temporal analysis is 



a separable covariance model (Stein (2005), Genton (2007)). This model estimates a covariance 
matrix for each mode of the array and is written Cov[vec(i£)] = Sjf ® Tk-i ® ■•• <S> T\, where 
"vec" and "(8)" denote the vectorization and Kronecker operators, respectively. In the context of 
the mortality data, this model contains a covariance matrix for country (T c ), time period (Tit), 
age (T a ), and sex (T s ). A separable covariance model with the assumption that the deviations 
are normally distributed, vec(E) ~ normal(0, Cov[vec(i?)]), is an array normal model and was 



developed by Hoff (2011) as an extension of the matrix normal (Dawid (1981), Browne (1984), 
OOTt| ( p99| ». 
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The mode covariance matrices in the array normal model are not estimable for certain array 



dimensions using standard techniques such as maximum likelihood estimation (Manceur and Du 



tilleul (2013)). However, often the covariance matrices of large modes can be well approximated by 
matrices with simpler structure. A common approach in the social sciences to modeling the covari- 
ance matrix of a high-dimensional random vector is to use factor analysis. The standard /c-factor 
model for a random vector x G MP parameterizes the covariance matrix as Cov[x] = AA T + D 2 , 



where A G M pxfe , k < p, and I? is a diagonal matrix (Spearman (1904), Mardia et al. (1979)). We 
will refer to this model as single mode factor analysis as it models the covariance among one set of 
variables. When the number of independent observations n is less than p, the sample covariance 
matrix is not positive definite and hence cannot be used as an estimate of Cov[x]. Nevertheless, 
under the assumption that x follows a multivariate normal distribution with known mean, the max- 



imum likelihood estimate of the factor analytic covariance matrix exists if k < min(p, n) ([Robertson 



and Symonsl fl2007| )). 

We propose a submodel of the array normal model where each mode covariance matrix poten- 
tially has factor analytic structure. We call this model Separable Factor Analysis (SFA) and it is 
written as follows: 

vec(-E) ~ normal(0, Cov[vec(£')]) 
Cov[vec(£)] = Sjj- ® Ejf-i ® ». ® Si, (1) 

where Sj = A{Aj + D 2 for < ki < 

and £j is unconstrained (i.e. equals any positive definite matrix) if ki = rrtj. SFA models are 
characterized by the covariance matrix structure chosen for each mode and can be represented by 
a i^-vector of ranks (ki, ...,kx), where ki equals the rank of Aj if mode i's covariance matrix has 
factor analytic structure and equals rrtj if the mode covariance matrix is unstructured. Note that 
we consider the ki = case where the covariance matrix is diagonal, reflecting independence of 
entries along the mode. A fcj-factor analytic covariance matrix, A{Aj + Df, contains S(rrii,ki) = 
[(mi — ki) 2 — (rrii + fcj)] /2 fewer parameters than an unstructured rrti x m, covariance matrix. If 
5(nii,ki) < 0, the factor analytic covariance matrix does not provide reduced structure, and to 
prevent overparameterizing the model, either the factor analytic rank, ki, should be decreased or 
an unstructured covariance matrix should be specified. 
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SFA has advantages over the array normal model that stem from it being an extension of factor 
analysis to multiple modes. Each mode covariance matrix can be interpreted independently as a 
decomposition of the mode variability into shared and unique latent components as in classical 
factor analysis (see Properties below). In addition, empirical evidence has shown that MLEs of 
the SFA covariance parameters exist for array dimensions where the MLEs of the array normal 
unstructured covariance matrices do not exist. 



2.2 Properties of SFA 

In this section we relate the SFA parameters to those in ordinary factor analysis, discuss indeter- 
minancies in the model, and interpret the SFA parameters when the true covariance matrix in each 
mode is unstructured. We use the concept of array matricization to describe how these properties 
relate to each mode. Here we define matricizing an array in the ith mode as unfolding the array 
into a matrix of dimension (jm x Y\j^=i m j)i w h ere an earlier mode index always varies faster 
than a later mode index in the columns (Kolda and Bader ( 2009| )). There are multiple ways to 
matricize an array but here we follow this convention (Kiers (2000), De Lathauwer et al. (2000|)). 



Latent variable representation 

A single mode fc-factor model for a sample of n mean zero p-variate random vectors is written 
{x\, ...,x n } ~ i.i.d. normal(0, AA T + D 2 ), where A G M. pxk and D is a diagonal matrix. Collecting 
the random vectors in a p x n matrix X = [x\, x n ], this model has an equivalent latent variable 
representation as a decomposition into common latent factors, Z = [z\, z n ], and variable specific 
latent factors, E = [e 1; ...,e n ], as follows. 

Xpxn = A-pxh^kxri DpxpEpxn 

{zi,...,z n } ~ i.i.d. normal(0,I fc ) Cov[z{, e,] = fcxp for alU, j (2) 

{ei, e n } ~ i.i.d. normal(0,I p ) 

In this representation the jth observation of the ith variable Xij is written as a linear combination 
of common latent factors z% with coefficients given by the ith row of A, plus a single variable specific 
factor Eij, whose coefficient is the ith diagonal element of D. 

A similar representation exists for each mode with a factor analytic covariance structure in the 
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SFA model. Consider a mean zero array Y and an SFA model with a factor analytic covariance 
matrix in the ith mode. Define Y % to be the array obtained by standardizing Y with all but the 
ith mode's covariance matrix: 

vec(y i ) := vec(Y)(£ x 1/2 <g> ... <g> E^ 1 / 2 ® l mi ® S^ 1 / 2 ® ... ® S^ 1/2 ). (3) 

It follows that 

Yfo = [yi,...,y m _J =A^ + A^ and {yi, y m _J ~ i.i.d. normal(0, AjAf + _D, 2 ) (4) 

where rri-i = Ylj^THj, and Z J and E l are fcj x m_j and x m_j, respectively, with the same 
distributional properties as Z and E in Q. The superscript i on Yh. indicates the zth mode has 
not been standardized and the subscript (i) indicates the array has been matricized along the ith 
mode. The representation in Q suggests the parameters {Aj,Dj} can be viewed as single mode 
factor analysis parameters for the ith mode of the array when the covariance in all other modes 
has been removed. 

Model indeterminacies 

SFA as parameterized in ([T]) has two indeterminacies, one of which is common to all factor models 
and one that is common to all array normal models. The first indeterminacy, which is also present 
in single mode factor analysis, is the orientation of the A matrices. The array covariance matrix in 
([!]) is the same with mode i factor analytic parameters {Aj, D{\ as it is with parameters {AjGj, Di}, 
where Gi is any ki x ki orthogonal matrix. The second indeterminacy concerns the scales of the 
mode covariance matrices and stems from the model's separable covariance structure. The scale 
of a mode's covariance matrix can be moved to another mode covariance matrix or split among 
multiple modes' covariance matrices without changing the model. For example, the transformation 
i — y {cSi,Sj/c} does not affect the array covariance matrix for any c > 0. This scale 
non-identifability is eliminated if all mode covariance matrices are restricted to have trace equal to 
one and a scale parameter is included for the total variance of the array. 

Pseudo-true parameters 

In single mode factor analysis the goal is to represent the covariance among a large set of variables 
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in terms of a small number of latent factors. However, often it is unlikely the true covariance matrix 
has factor analytic structure. Consider a p x n matrix X = [x±, x n ] and suppose {x\, ...,x n } ~ 
i.i.d. normal(0,X). There is interest in what fc-factor analytic parameter values, A and D, best 
approximate the true covariance matrix X. These optimal parameter values, denoted A(S) and 
-D(X), are those that minimize the Kullback-Leibler (KL) divergence between the factor model 
and the multivariate normal model. Minimizing the KL divergence is equivalent to maximizing the 
expected value of the A;-factor analysis (FA) probability density with respect to the true multivariate 
normal (MN) distribution. Thus, A(S) and D(T,) can be defined as 

{A(E),S(S)} := argmaxE M N[PFA(^|A, J D)] = argmax c FA -"log(|AA T + J D 2 |)-^tr[(AA T +D 2 )- 1 S] 

A,D A,D * * 

where "tr" represents the trace operator and cfa is a constant not depending on A or D. The 
diagonal matrix D that best approximates the true covariance matrix in the case of k = is given 
by -D(S) = diag(E) 1 / 2 , where "diag" is the operator that replaces all off-diagonal entries with zero. 

Similar to single mode factor analysis, SFA is an approximation to a separable covariance struc- 
ture and modes' true covariance matrices are unlikely to have factor analytic structure. Suppose the 
distribution of Y is array normal with mean zero and covariance matrices S = {Sj:l<i< K}. 
Consider a (ki, ...,kjc) SFA model for Y with parameters A = {Aj : < k{ < m;}, D = {Di : 
< ki < rrii}, and XI = {T,j : kj = mj}. The expected value of the SFA probability density with 
respect to the true array normal (AN) model is 

K K 

E A n[psfa(^|S, D, A)] = csfa - T^kgflEil) " o II ^Pi" 1 ^], (5) 

i=l m% i=l 

where Sj = AjA^ + Df for < fcj < rrii, 

csfa is a constant independent of the SFA parameters, and m = YliLi m i- Let A(S), D(E), and 
S(S) denote the SFA parameters that maximize ^ and, hence, provide the best approximation 
to the true separable covariance matrix. It can be shown that for all appropriate i, j, and k 

A i (E) = A(S i ), S i (S)=S(E i ), and !! fc (S) = E fc . (6) 

This means that the best factor analytic parameters for a given mode in the SFA model are 
the closest fitting single mode factor analytic parameters for that mode's true covariance matrix. 
Furthermore, as we might expect, the optimal values of the unstructured covariance matrices in the 
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SFA model are the modes' true covariance matrices. This implies that when the true model is array 
normal, the optimal SFA parameters for a given mode do not depend on the specified covariance 
structures in the other modes. Note that the scale indeterminacy of the covariance matrices is 
still present here. Thus, there is a set of optimal SFA parameter values that provide the same 
approximation and can be derived from one another by reallocating the covariance matrices' scales. 
Asymptotically, as the number of replicates of the array increases, these optimal SFA parameter 
values are the limiting values of the SFA maximum likelihood estimates QWhite| ( fl982| )). 



3 Estimation and testing 

In this section we consider parameter estimation for the SFA model and propose a likelihood ratio 
testing procedure for selecting the ranks (k\, kx)- Two estimation methods are described here: 
an iterative algorithm for maximum likelihood estimation and a Metropolis-Hastings algorithm 
which approximates the posterior distribution of the parameters given the data. For notational 
convenience we present the case where the array has mean zero, however both estimation methods 
and the testing procedure can be extended to allow for a mean structure. Examples of such 
extensions are discussed in Section [4] for the mortality data. 

3.1 Maximum likelihood estimation 

Simultaneous maximization of the SFA log likelihood with respect to all parameters is difficult. 
However, the maximization steps are manageable if done separately for each mode's covariance 
parameters. We propose an iterative algorithm that at each step maximizes the SFA log likelihood 
over a single mode's covariance parameters using the latest values of all other modes' parameters. 
This algorithm can be viewed as a form of block coordinate ascent and is guaranteed to increase 
the log likelihood at each step. 

Let A = {Aj : < ki < rrii}, D = {Di : < ki < rrii}, and 5] = {Ej : kj = rrij} as in Section 
Also let A_j = A/{Aj} be the set A with Aj removed, and define D—j and analogously. 



2.2 



The iterative maximum likelihood algorithm proceeds as follows. 

0. Specify initial values for all covariance parameters {A, D, £}. 

1. For each mode {i : ki = 0}, update the estimate of D{. 
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2. For each mode {i : < ki < wii}, update the estimates of Aj and -Dj. 

3. For each mode {i : fcj = mj, update the estimate of £j. 

4. Repeat steps 1-3 until a desired level of convergence is obtained. 

The maximization in steps 1 and 3 over a diagonal matrix and an unstructured covariance 
matrix, respectively, are straightforward. The SFA log likelihood as a function of -Dj, treating all 
other parameters as fixed, is written 

«(A|E, A,D^,Y) = b l - £^log(\Df\) - ^[YfoiYfof Df) (7) 

where hi is a constant independent of -Dj and m = ]X m,. The maximizing value of -Dj and update 
for step 1 is thus 

D ? = dia g (^-,(^F) 

where the covariance matrices used to standardize Y in Y l are the latest covariance matrix esti- 
mates. The SFA log likelihood as a function of an unstructured covariance matrix Sj is given by ([7]) 
where D 2 is replaced by £j. Hence, the value of £j that maximizes the corresponding log likelihood 
and the update for step 3 is 

Estimation of a mode's factor analytic parameters in step 2 is more difficult, but can be ac- 
complished using methods developed for single mode factor analysis. The SFA log likelihood as 
a function of the ith mode's factor analytic parameters, treating all other modes' parameters as 
fixed, is 

l(A t , A|S, A_i, D-i,Y) =a- £-log(|A,Af + D}\) - ^tr[(A,Af + D^Y^f] (8) 

where c% is a constant not depending on Aj or Di. The log likelihood for a single mode /cj-factor 
model for a p x n matrix X is written 

£(A, D\X) = c - ^log(|AA T + D 2 \) - ^tr[(AA T + D 2 )~ 1 XX T \. (9) 

Notice that the SFA log likelihood has the the same form as that for single mode factor analysis 
where XX T and n are replaced by Y^(Y^) T and rn/rrii, respectively. Therefore, we can use 
existing estimation methods for single mode factor analysis to update Aj and Dj. 
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Numerous iterative algorithms have been developed to obtain the single mode factor model max- 
imum likelihood estimates; however many suffer from poor convergence behavior ( |Lawley (1940), 



Joreskog (1967), Jennrich and Bobinson (1969)). An expectation-maximization (EM) algorithm 
was developed based on the model representation in ([2]) that treats Z as latent variables (Demp- 



ster et al. (1977), Rubin and Thayer (1982)). The slow convergence of this algorithm led to 
expectation/conditional maximization either (ECME) algorithms, some of which rely on numerical 



optimization procedures (Liu and Rubin (1998), Zhao et al. (2008)). Zhao et al. (2008) proposed 



an iterative algorithm that updates A, treating D as known, and then sequentially updates each 
diagonal element of D, treating A and all other elements of D as known. This algorithm has closed 
form expressions for all parameter updates and was shown to outperform the EM algorithm and 
its extensions in terms of convergence and computation time. For these reasons, we chose to use it 
for step 2 of the SFA estimation procedure. 

Divergence of the SFA maximum likelihood algorithm, where the log likelihood continually 
grows at a nondecreasing rate, is evidence that the maximum likelihood estimates do not exist. 
While the update in step 1 for a mode with a diagonal covariance matrix is alway well defined 
(i.e. the SFA log likelihood in ([7]) has a maximum in terms of Di), step 2 of the algorithm for 
an unstructured covariance matrix is only well defined if mj < Ylj^i m j ■ Similarly, step 3 is well 
defined for a mode i with a factor analytic covariance matrix if k{ < rank(Y^(Y^) T ). This latter 
requirement is effectively equivalent to ki < min(mj, Ylj^i m j) since Yfa is unlikely to be rank 
deficient for a continuous array Y. 



3.2 Bayesian estimation 

Mortality information is limited for many undeveloped countries that do not have reliable death 
registration data. Thus, it is not uncommon to be missing a country's death rates for specific ages 
or at all ages in a given year. A Bayesian approach to parameter estimation can easily accommodate 
missing data and provide predictive distributions of the missing values. Let Y Q denote the portions 
of the array Y that are observed and Y m represent those values that are missing. Inference for the 
parameters and the missing data can be based on the joint posterior distribution of the parameters 
and missing data given the observed data, p(A, D, X, Y m | Y G ). This posterior distribution is written 
p(K,D,Y,,Y m \Y ) ocp(T|A,.D,£)p(A,.D,£), where p(Y\A, D, E) is the density of the (kx,...,k K ) 
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SFA model for Y = {Y ,Y m } and p(A, _D,S) is the joint prior distribution of the parameters. In 
the case of no missing data, one can consider Y m to be the empty set and Y Q to equal Y. 



Prior specification 

In the absence of real prior information, we suggest a convenience prior composed of semiconjugate 
distributions for the parameters, which also reflects some of the indeterminacies in the model. For 
an SFA model in which mode i's covariance matrix is unstructured, the prior distribution for S^ 1 
is Wishart(Kj, I m J with hyperparameter m, where Kj > m;. For a mode i with a factor analytic 
covariance matrix, the joint prior distribution of {Aj, Di} is specified by the marginal distribution 
of Di and the conditional distribution of Aj given Di, as follows: 

{vec(A i )|A} ~ normal(0,I fc8 ® D 2 ) (10) 
{D^ 2 [l,l],...,Di 2 [mi,mi\} ~ i.i.d. gamma(f /2, rate = Mo/ 2 ) > 0, > 0. (11) 

A priori each mode's parameters are modeled as independent of all other modes' parameters given 
the hyperparameters uq, cZq, and {k, : ki = nrii}. Thus, the joint prior distribution p(A, D, S) is 
equal to the product of the marginal distributions of each mode's parameters. 



The prior distribution of the factor analytic parameters given in ( 10 ) and ( 11 ) has nice properties 
related to the rotational indeterminacies in the A matrices. Recall that the SFA likelihood is 
invariant to rotation of A«, meaning Lsfa(Aj, Di, S, A_j, D—i\Y) = 

LsFAl-A-jGj, Di, S, A_j, D^i\Y) where LgpA is the SFA likelihood and Gj is any ki x ki orthogonal 
matrix. If the joint prior distribution p(Ai, Di) is integrated over the diagonal elements of Di, the 
marginal distribution of Aj is obtained and can be expressed as 



p(Ai) oc Yl [u Q dl + \\Ai\j, 



,21 (ki+v )/2 



where || • || 2 denotes the Frobenius norm. Observe that p(Aj) = p(AjGj) implying that the prior 
distribution is also invariant to rotations of Aj. This is a desirable property as it indicates the prior 
does not favor one set of parameters over another if they are equivalent given the data (i.e. have 
the same SFA likelihood). Namely, all parameter values {AjGj : GjGi = GiGj = 1} are given 
equal probability in the prior. 
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Metropolis-Hastings algorithm 

The posterior distribution p(A, D, XI, Ym| Yo) is not a standard distribution and is difficult to sample 
from directly, so a Metropolis-Hastings algorithm is used to obtain a Monte Carlo approximation of 
it. The Metropolis-Hastings algorithm produces a Markov chain of {A, D, X, Y m }, whose stationary 
distribution is equal to p(A, D,Yl,Y m \Y ). The algorithm proceeds by iteratively proposing new 
values of the missing data and each mode's parameters, and accepting or rejecting the proposals 
based on an acceptance probability. The algorithm can be described as follows: 

0. Specify initial values for all covariance parameters {A, D, £} and missing data Y^. 

1. For each mode {i : ki = 0}, update D^. 

2. For each mode {% : < ki < rrii}, update Aj and Di. 

3. For each mode {i : ki = m;}, update £j. 

4. If elements of Y are missing, update Y m . 

5. Repeat steps 1-4 until a sufficiently accurate approximation of the posterior distribution is 
obtained. 

The update of Di in step 1 is straightforward since the full conditional distribution of the entires 
in D^ 2 given the data and all other parameters is a product of the following independent gamma 
distributions: 

{Dr 2 \j,j]\D- i: A, S, Y} ~ gamma ((u + m/mO/2, rate = (u d 2 + Si[j,j]) /2) (12) 

for j G {1, ...,rrii} where Si = YL(Y^) T . The Metropolis-Hastings acceptance probability is equal 
to one when sampling from parameters' full conditional distribution. Thus, the update in step 1 is 



performed by setting the new value of Di equal to a sample from (12), where Y is comprised of Y a 
and the most current update of Y m , and the covariance matrices used to standardize Y in Y % are 
the most current parameter updates. 

The updates for the factor analytic parameters {Aj, Di} in step 2 are based on the latent variable 
representation of SFA introduced in Q, which is that = AiZ 1 + DiE 1 where the elements of 
Z % and E l are independent standard normal random variables. Conditioning on Z l , the mode i 
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fcj-factor model can be written 

{vec(Y^)\Z i , Ai, Di} ~ normal(vec(A i Z i ), I m/m . ® D 2 ). 
Using this representation of the model, we developed the following Metropolis-Hastings updates: 

A. Update Aj as follows. 

(i) Sample {vec(Z i )|A i , D h Y 1 } ~ normal(vec(0Af DT 2 Yfo) , l m/nii ® 0) 
where <j) = {AjD~ 2 A l + I)" 1 . 

(ii) Sample 

{vecCAiJlZ*,^ A A-*,S} ~ normal( 7 (4 ) » J Dr 2 )vec(y ( J i) ),7 = [(^(^f+W®^ 2 ]- 1 ). 

B. Update Dj as follows. 

(i) Sample {vec(Z*)|Aj, D h Y { } as in A(i). 

(ii) Sample the elements of D 2 independently from 

{D~ 2 \j, j] \Z\Y, D-i, A, £} ~ gamma ((i/ + m/m, + fcj)/2,rate = (^ c?§ + J\j,j) + ||Ai[i,]|| 2 ) /2) 
where J = (Y^ — AjZ*)(Y^ — AiZ l ) T and || • || 2 denotes the Frobenius norm. 

These updates for Aj and D{ are Metropolis-Hastings proposals with acceptance probabilities equal 
to one (see Appendix [A]) . 

Similar to step 1, the update of Sj in step 3 can be performed by sampling from the full 
conditional distribution of S^ 1 given the data and all other parameters: 

{T,T l \D, A, £_;, Y} ~ Wishartfo + m/m h (I m . + Y^(Y^) T ) _1 ). (13) 

This proposal also has acceptance probability equal to one, and as in step 1 and 2, the latest 
updates of all other modes' parameters and the missing data are used in the calculation of Y\ 

Step 4 of the algorithm that updates the missing data can be done in one update or as a sequence 
of updates for any partition of Y m . All elements of Y m can be updated together by sampling from the 
conditional distribution of vec(Yn) given the covariance parameters and observed data. Although 
the conditional distribution is normal, such an update can be expensive due to the large matrices 
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often involved in computing the distribution's covariance matrix. Updating the missing data in 
partitions of the array known as slices, where one mode index is fixed, avoids the need to work 
with such large matrices. For the mortality data, examples of slices include the data for country c 



for all ages, sexes, and years or the data at age a for all sexes, years, and countries. Hoff (2011) 
shows that for a separable covariance structure the conditional distribution for a slice of an array 
given the rest of the array can be written as array normal distribution. If the missing data in the 
slice is then conditioned on the observed data in the slice, a multivariate normal distribution is 
obtained that can be used to update the missing data. Calculating the conditional distribution of 
the missing elements in a slice of the array via this two-step conditioning procedure (once for the 
slice and once for the missing data within the slice) circumvents computation with unnecessarily 
large matrices. As in the update of Sj, the normal distributions mentioned here represent the full 
conditional distributions of Y m and a subset of Y m so the acceptance probabilities equal one. Note 
that although updating subsets of Y m may be easier computationally, it has the potential to make 
the Markov chain less efficient and increase the number of samples needed to obtain an accurate 
approximation of the posterior distribution. 

Unlike in the frequentist setting where divergence of the maximum likelihood estimation pro- 
cedure indicates a lack of information in the data about the parameters, the posterior distribution 
of the parameters given the data will always exist. Although Bayesian parameter estimates are 
available, we should be aware of what information the estimates reflect. The posterior distribution 
of the parameters given the data is combination of the information in the prior distribution and 
the information in the data. Extreme similarity between the prior distribution and the posterior 
distribution suggests that little information is gained from the data and inference based on the 
posterior distribution is primarily a reflection of the information in the prior. 



Hyperparameters 

When there is little prior information about the parameters, it is common to choose hyperparameter 
values that result in diffuse prior distributions. We propose vq = 3 and Ki = rrii + 2 for {i : ki = mj} 
as default values for the SFA model. These values ensure that the first moments of the prior 
distributions are finite and represent some of the most diffuse distributions in the Wishart and 
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gamma families, respectively. They also have the following properties. 

E[Ei]=l mi E[Df\j,j]] = 3d 2 E[tr(A*Af )] = 3fcroidg (14) 

Prior information about specific mode covariance matrices may be limited, however an esti- 
mate ip of the total variance of the array, ip = tr(Cov[vec(y)]) = n£i^ r (Si), may be available. 
This information can improve parameter estimation by centering the prior distribution of the to- 



tal variance of the array around a reasonable value. Based on the expectations in (14) and the 
independence of the mode covariance matrices in the prior, the prior expected value of the total 
variance of the array will equal the estimate, E[tr(Cov[vec(y)])] = ip, if 

-l/R 

d 2 = fr/ R 



II [k j + l])(flm^3 R 

x jm- / V;=i / 



(15) 



where R = Y,f=i 1{0 < h < rrii} is the number of modes with factor analytic covariance structure. 
In the event there is no prior knowledge about ip and it is not of interest in the analysis, we propose 
taking an empirical Bayes approach and obtaining an estimate of it based on the data. Possible 
estimates include ip = ^\\Y \\ 2 or ip = ^p\\Y — M (X,f3)\\ 2 if the model has a non-zero mean, 
where m denotes the number of observed entries in Y. In the latter case, M (X,/3) represents 
an initial estimate of the mean for the observed data, such as the ordinary least squares estimate. 
Specification of cIq, Uq, and {k^ : h- L = rrii} as described here weakly centers the prior distribution 
for the total variation in Y around the estimate ip. 

3.3 Testing for the mode ranks 

It is often difficult to choose the number of factors for a single mode factor model. This problem is 
only more pronounced in the array case where the rank ki must be specified for each mode. As is 



done in single mode factor analysis (Mardia et al. (1979)), a likelihood ratio test can be constructed 



to test between nested SFA models with ranks (ki, kx) and (k*, k* K ), where ki < k* for all i. 
However, due to the large number of possible combinations of ranks, choosing the ranks using these 
likelihood ratio tests is challenging. Here we propose an alternative mode-by-mode rank selection 
procedure that suggests when the rank specified for a given mode is sufficient for capturing the 
dependence within that mode. 
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Suppose a K-w&y array Y is normally distributed, and define Y to be the array obtained when 
Y is standardized by its covariance matrix £ = Cov[vec(y)]: vec(y) — vec(y)£~ 1//2 . The elements 
of Y represent independent standard normal random variables. Thus, to determine whether the 
covariance in mode i is captured by a proposed (k±, kx) SFA model, we can compute Y using 
the SFA mode covariance matrix estimates as in ([I]) and test whether the covariance matrix of the 
rows of Yu\ equals the identity. The likelihood ratio test statistic for this test is 

in .~ 

[tr(V0-log|F|- mi], (16) 



where V = ^-Y^Y^, and has an asymptotic xLfm +11/5 distribution under the null hypothesis 



of an identity row covariance matrix. Note that rejecting this test suggests that a more complex 
covariance structure is needed for the mode. We propose the following rank selection procedure 
based on these mode specific tests. 

0. Consider an SFA model with all ki = 0. Obtain estimates of the covariance parameters using 



the maximum likelihood procedure in Section 3.1 and compute Y using the estimates. 



1. For each mode i, define Ri = Cov[vec(Y(j))] and test Hq : Ri = l m / mi <8> I mj vs H\ : R4 = 
^-m/mi ® V , where V is an unstructured rm x m% covariance matrix, using a likelihood ratio 



test with test statistic given by (16). 

{5(rrii, ki + 1) > 0, increase the rank ki by one. 
5(rrii, ki + 1) < 0, set the rank equal to mi. 
If the test for mode i does not reject, fix k{ at its current value and perform no further tests 
on the mode. Obtain maximum likelihood estimates {S, A,D} for an SFA model with the 
new ranks (k\, kx) and compute Y using these new estimates. 

3. Repeat steps 1-2 until each mode has failed to reject a test. 

The suggested ranks (ki,..,kx) are those that result at the end of this procedure. Recall that 
5{m,k) = [(m — k) 2 — (m + k)\ /2 represents the reduction in the number of parameters when 
using a fc-factor analytic covariance matrix instead of an m x m unstructured covariance matrix. 
The rank increases in step 2 reflect that an unstructured covariance matrix is specified when a 
factor analytic covariance structure no longer provides a reduction in the number of covariance 
parameters. 
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The maximum number of SFA models that could be considered using this procedure is bounded 
by largest value of ki such that 5(mi,ki) > 0, where I denotes the array mode with the largest 
dimension mi . To control the type I error rate of all mode tests to be a for an iteration of steps 1 
and 2, the level of each mode test can be set to a r where r is the number of modes being tested 
(i.e. the number that have rejected every test thus far). An example of this procedure is described 



in Section 4.2 for the mortality data. 



4 Application to Human Mortality Database death rates 

In this section we analyze death rates from the Human Mortality Database (HMD) (University 
of California, Berkeley and Max Planck Institute for Demographic Research, 2009) using an SFA 
model, compare our model to other covariance models, and obtain predictions for over four hundred 
missing death rates. We focus on death rates for 5-year time periods for populations corresponding 
to combinations of sex, age, and country of residence. Specifically, we consider death rates from 1960 
to 2005 for 40 countries, both sexes and twenty-three age groups, {0, 1 — 4, 5 — 9, 10 — 14, 105+}. 
These data are represented in a 4-way array Y = {y c tsa} of dimension (40 x 9 x 2 x 23), where y c tsa 
is the log death rate for country c, time period t, sex s and age group a. We will refer to a set of 
age-specific death rates for a combination of country, time period, and sex as a mortality curve. 

We begin this section by introducing a flexible piecewise polynomial mean model and show the 
residuals from this mean model exhibit dependence within each mode: age, time period, country, 



and sex. Using the likelihood ratio testing procedure presented in Section 3.3 we select ranks for 
an SFA model. The resulting SFA model is compared to models with simpler covariance structures 
using out-of-sample cross validation and is used to impute multiple years of missing death rates for 
Chile and Taiwan. 



4.1 Mean model selection 

As discussed in the Introduction, existing methods for analyzing mortality data model the death 
rates for different countries, sexes, and/or time periods separately. Such an approach can be 
inefficient due to the strong similarities between mortality rates within the same country, time 
period, or sex. For this reason, we propose a new joint mean model for the HMD data that 
acknowledges the relationships between mortality rates that share levels of one or more of these 
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factors. 

Figure [l] shows mortality curves denned by the twenty-three age-specific death rates for the 
United States and Sweden in four time periods. The large spikes at age zero represent infant 
mortality, and the humps around age twenty, which are especially evident in males, are attributed 
to teenage and young adult accident mortality. The overall shapes of the mortality curves for each 
sex are similar across countries and time periods, however Sweden has considerably lower mortality 
levels during childhood and young adulthood compared to the United States. This suggests that 
a mean model for the data should allow for different curves across countries and time periods, yet 
still take advantage of the similarity between death rates within the same country, age group, or 
sex. 

Drawing from the mortality literature and viewing mortality rates as function of age, we propose 
the following piecewise polynomial (PP) mean model: 



E[y. 



cysa\ 



+ acP 11 + a 2 <p 12 
+ ad 21 + a 2 d> 22 + a 



: a = 
: 1 < a < 20 
3^23 . 20 < a 



4 + PI + il 



(17) 



This model distinguishes between the infant, childhood, and adult stages of mortality by fitting each 
with a separate polynomial, whose coefficients are composed of additive effects for country, time 
period, and sex. The constant term at age zero is necessary to model the steep decline from infant 
mortality to child mortality that is not well represented by a low degree polynomial. Parameter 
estimates for this model based on the data array can be obtained by minimizing the ordinary the 
least squares (OLS) criterion YlcTlt [Vctga — ^[Uctga]] 2 > and since the model is linear in its 

parameters, the OLS estimates can be solved for algebraically. 

One of the most commonly used models in demography for age-specific mortality measures 



is the Heligman-Pollard (HP) model (Heligman and Pollard (1980)). This model also uses eight 
parameters to parameterize a mortality curve, however it is typically used to model each mortality 
curve individually and is nonlinear and non-convex in the parameters making estimation extremely 
difficult (Hartmann (1987), Congdon (1993)). When the HP model is fit separately to the 684 
HMD mortality curves for the 38 countries missing no death rates using OLS, it requires over 
5,400 parameters and under the assumption of independent, homoscedastic errors has a Bayesian 
Information Criterion (BIC) value of —17,288. However, when the PP model is fit jointly to the 
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same data, it contains 376 parameters and has a BIC of —52, 436. Due to the relative parsimony of 
the PP model, its superior fit in terms of BIC, and its straightforward estimation, it was selected 
as the mean model. 



4.2 Excess dependence and SFA rank selection 



The piecewise polynomial model in (17) is extremely flexible. To investigate its fit to the HMD 
mortality rates, we focused on a subset of the original data that contains no missing observations, 
specifically the (38 x 9 x 2 x 23) array that does not have death rates for Chile or Taiwan. The 
OLS fit this data explains 99.5% of the variation in mortality rates (coefficient of determination, 
R 2 = 0.995). However, there is interest in whether excess correlation exists in the residuals since 
modeling it can improve both predictions of missing values and the efficiency of parameter esti- 



mates. Ordinary least squares estimates of the parameters in (17) are equivalent to maximum 
likelihood estimates assuming independent normal errors. To evaluate this latter assumption, we 
computed the empirical correlations between the mean model residuals for countries, time periods, 
and age groups by matricizing the residual array with respect to each mode and computing a sample 
correlation matrix for the mode. 

As mentioned in the Introduction, the distributions of these correlations have substantially 
more large positive values than would be expected under the assumption of independent errors. 
For example, speaking specifically to the temporal dependence, the average correlation between 
adjacent time periods, those one time period apart, and those two periods apart is 0.79, 0.54, and 
0.26, respectively. The first two principal components of each correlation matrix are shown in Figure 
[2] The horseshoe pattern in the time period principal components and the clustering of countries 
within the same region suggests temporal and geographic trends in the data are not captured by 



the mean (Diaconis et al. (2008)). This indicates that even though the mean model contains several 
country specific and time period specific parameters, similarities between the mortality curves of 
certain countries and time periods is not being accounted for. The mean model already contains 
over 370 parameters and it would likely be nontrivial to modify it to capture all of the dependence 
seen in the residuals. For this reason, we consider incorporating a covariance structure to model 
this excess dependence. An array normal separable covariance structure could be specified, however 
it would add over one thousand parameters to the model. Therefore, we instead consider an SFA 
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model for the data with the PP mean with the belief that some of the residual mode covariance 
matrices may be well approximated by a low rank factor analytic structure. 



As outlined in Section 3.3, suggestions for the SFA ranks can be obtained from a repeated 
likelihood ratio testing procedure. For the mortality data, we consider (k c , k t , k s , k a ) SFA mod- 
els where the ranks correspond to the country, time period, sex, and age covariance matrices, 
respectively. The standardized residual array Y for a (k c , kt, k s , k a ) SFA model is defined as 

- — ■ - — - -~- 1/2 1/2 1/2 1/2 ' — - 

vec(Y) = (vec(y) - vec(M))(S a v " ® S s x/i ® E t 1 ® S c where M represents the PP mean 
model maximum likelihood estimate and Sj is the SFA mode i covariance matrix estimate. The 
results from the iterative testing procedure are shown in Table [TJ The first step in this process is 
to consider a (0,0,0,0) SFA model where all covariance matrices are diagonal. The likelihood ratio 
test statistics for this model are shown in the first row of Table [T] and the corresponding 0.05 level 
critical values are shown the last row. Since the test for each mode rejects the null hypothesis 
of independent, variance one errors, the rank of each mode is increased by one in the subsequent 
model, except for that for the sex mode. A rank one factor analytic structure for a (2 x 2) covari- 
ance matrix has more parameters than an unstructured covariance matrix so the sex covariance 
matrix is unstructured in the next model. A box around a test statistic in the table indicates the 
mode failed to reject the test for the first time. Recall that when a mode's test does not reject, 
the rank for that mode is fixed and not increased in later models. The table shows where the sex, 
time period, country, and age ranks become fixed at two, four, nine, and ten, respectively. Observe 
that after a mode's rank is fixed, the test statistic for that mode stays below the critical value in 
all subsequent models. Although the mode tests are not independent of the covariance structures 
fit in the other modes, this consistency supports the suggested ranks. 

4.3 Out-of-sample cross validation 

We evaluate the SFA model by comparing its out-of-sample predictive performance with two simpler 
covariance models that share the same PP mean model. The three covariance models considered 
are the following: 

Ml: Independent and identically distributed (i.i.d.) model 

M2: Time covariance model (0,9,0,0) 

M3: SFA model (9,4,2,10) 



23 



Table 1: Iterative testing procedure for the SFA ranks. Each row represents an SFA model and 



each entry is the likelihood ratio test statistic based on (16). The 0.05 level critical value for each 



test is given in the last row. A box around a statistic indicates that the mode does not reject the 
test for the first time and the rank is fixed in subsequent models. 



SFA ranks 



Likelihood ratio test statistic 
Country Time period Sex Age 



(0,0,0,0) 

(1,1,2,1) 
(2,2,2,2) 

(3,3,2,3) 
(4,4,2,4) 
(5,4,2,5) 
(6,4,2,6) 
(7,4,2,7) 
(8,4,2,8) 
(9,4,2,9) 
(9,4,2,10) 



21,852 
9,526 
4,425 
2,776 
1,946 
1,556 
1,287 
1,040 
892 
762~ 
737 



14,482 
5,853 
1,722 
716 



17 



14 
10 



702 













27,883 
14,451 
6,374 
3,762 
2,422 
1,833 
1,340 
967 
540 
363 
257 



~ critical value 



X.95 



742 



45 



276 



Ml corresponds to the conventional ordinary least squares (OLS) approach where all errors are 
assumed independent and identically distributed with a common variance parameter. Based on the 
temporal nature of the data, a natural first step to incorporating a covariance model is to consider 
an unstructured covariance matrix for time as in M2. In general, country mortality rates are 
relatively stable over time so if the observed mortality for a given country, year, and age deviates 
from the mean model in one year, it is likely the observations deviate in the same direction in 
neighboring years. 

Fifty cross validations were performed by removing a random 25% of the array, estimating each 
of the three models on the remaining data, and computing the mean squared error (MSE) between 
the observed values and the predicted values for the withheld entries. The predicted values for Ml 
are those from the OLS PP mean estimate. For M2 and M3, the predictions are the posterior mean 



estimates of the missing values from the Bayesian estimation procedure described in Section 3.2 
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A prior distribution for the parameters in the PP model is needed to perform simultaneous 
Bayesian estimation for the mean and covariance parameters. The prior on the vector of PP 
coefficients is a mean zero normal distribution with covariance matrix m(X T ' X)~ l , where X is the 



design matrix for the PP model for vec(Y) and m = IJi-Li m «- This is a relatively uninformative 



prior as it is over 30 times more diffuse than the corresponding unit-information prior (Kass and 



Wasserman (1995)). The hyperparameters were specified as described in Section 3.2 where the 
mean estimate M used in ijj is the OLS estimate of the PP model. Since M2 has no modes with 
factor analytic structure, the prior on the time covariance matrix is 



S t 1 ~ 



Wishart { nt = mt + 2 



m t 



mt 



This specification is necessary to preserve the property that E[tr(Cov[vec(Y)])] = ip under the prior. 

The results from the 50 cross-validations are shown in Tabled The MSE for the SFA model was 
less than that of the time covariance model for each of the 50 cross-validation runs, and the MSE 
for the time covariance model was always less than that of the i.i.d. model. In terms of average 
MSE, both the time covariance model and the SFA model significantly improve upon the i.i.d. 
model, and the SFA model out performs the time covariance model by nearly a factor of two. This 
is evidence that even with the extremely flexible PP mean model, the SFA covariance structure still 
improves model fit as it is able to estimate the similarity between mortality rates across countries, 
time periods, age groups, and sexes, and use this information in its predictions. 

Table 2: Average and standard deviation of the mean squared errors from 50 out-of-sample cross- 
validation experiments. 





Ml (i.i.d.) 


M2 (Time Covariance) 


M3 (SFA) 


Average mean squared error 


0.02996 


0.00729 


0.00385 


Standard deviation of mean squared errors 


0.00084 


0.00049 


0.00034 



4.4 Prediction of missing data 

The imputation of missing death rates is an important application of modeling mortality data as 
information is often incomplete for countries lacking accurate death registration data. We now 
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consider the original (40 x 9 x 2 x 23) array of mortality rates with observations for Chile and 
Taiwan. Seven time periods of mortality information are missing for Chile (1960-1995) and two 
time periods for Taiwan (1960-1970), combining for a total of 414 missing entries in the array. The 



maximum likelihood estimation algorithm in Section 3.1 cannot accommodate missing data so we 
are unable to reselect the SFA ranks using the testing procedure. However, this larger data array 
contains only two additional countries so the SFA ranks (9, 4, 2, 10) selected for the reduced data 
are used here. Predictions for the missing death rates were based on samples from the Metropolis- 
Hastings procedure, for which the effective sample sizes for the Monte Carlo estimates of all missing 
values was greater than 500. 

In the left column of Figure [3| posterior mean predicted death rates and 95% prediction intervals 
are shown for Chile in 1990 and Taiwan in 1965. To visualize the impact of the SFA covariance 
model on the predicted death rates, we investigate the difference between the SFA predicted values 
and the fitted values based on the PP mean model. The SFA predictions, y p , are conditional 
on the observed mortality rates for all other countries and time periods, while the mean model 
fitted values, y m , are based only on the estimate of the PP model. We call these differences, 
Vp — ym-, "predictive residuals" since they are based on predicted values instead of observed values. 
These differences illustrates the changes in the predicted values by using the estimated dependence 
between residuals within modes of the array and conditioning on the observed mortality rates. The 
empirical residuals based on the PP mean model, y — y m , were computed for the United States and 
Australia, the two countries most highly correlated with Chile (estimated correlations of around 
0.40). These residuals were also computed for Japan and West Germany, the two countries most 
highly correlated with Taiwan (estimated correlations of around 0.13). The middle column of 
Figure [3] shows the predictive residuals for Chile and Taiwan and the empirical residuals for these 
select countries. The last column contains the empirical residuals in 1995 and 1970 when mortality 
information is available for all countries. Observe that the plots in the middle column and last 
column are similar, demonstrating an overall positive association for both sexes and all country 
pairs. This illustrates how the model uses the relationship between the empirical residuals of Chile 
and other countries to predict Chile's deviations from the mean model in years when Chile data is 
missing. The ability to draw information across multiple country, year, and sex residuals to impute 
missing values is a critical strength of the SFA model that is not shared by other mortality models 
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or simpler covariance structures. 

The empirical residuals for Chile shown in the last column may not show as strong of an 
association with the United States and Australia as one would expect from a posterior mean 
correlation estimate of 0.4. However, recall that the estimate of the country correlations is based 
on all time periods, sexes, and ages. Although we show adjacent time periods in this plot, the 
correlation between the country residuals in the period adjacent to the missing time period and 
the correlations in time periods furthest away are weighted equally in the estimate of the country 
correlation, and hence weighted equally in the imputation of the missing data. For example, the 
correlation between Taiwan and Japan's empirical residuals in 2000 and that in 1970 influence 
Taiwan's imputations in 1965 equally. This property is a consequence of the separability of the 
SFA covariance matrix. A more complicated non-separable covariance model would be required for 
the correlations between countries, ages, and sexes to be differentially weighted in the imputation 
based on the proximity of the observed data to the missing data. 

5 Discussion 

In this article we introduced the separable factor analysis model for array-valued data. Unlike 
the array normal model where all mode covariance matrices are unstructured, SFA parameterizes 
mode covariance matrices by those with factor analytic structure. Using covariance matrices with 
reduced structure decreases the number of parameters in the model considerably and allows mode 
covariance matrices to be estimated using maximum likelihood methods for any array dimension. 
Including a covariance structure in a model for multiway data can drastically improve mean model 
parameter estimation and missing data predictions in situations where dependence exists within 
modes that is not captured by the mean model. In an out-of-sample cross validation study with a 
large set of mortality data, the SFA model was shown to have superior fit compared to models with 
simpler covariance structures, even in the presence of an extremely flexible mean model. The SFA 
model was also shown to estimate which countries have similar deviations from the mean model 
and was able to use this information to predict multiple years of missing death rates. 

An alternative extension of factor analysis to arrays was considered that resembles the higher 
order singular value decomposition (see Appendix [Bj . This model has a non-separable covariance 
structure and can be viewed a submodel of the single mode factor analysis model for vec(Y). We 
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Figure 3: The first column of plots shows the predicted values and corresponding 95% prediction 
intervals for the missing death rates for Chile and Taiwan. The middle column shows the difference 
between the posterior mean predicted value and the piecewise polynomial mean function fitted 
value, y p — y m , for Chile and Taiwan, along with empirical mean model residuals, y — y m , for 
countries that are highly correlated with them in the posterior mean country covariance matrix. 
The last column contains empirical residuals for the following time period when Chile and Taiwan 
mortality is observed. 

chose the SFA model over this alternative extension due to the interpretability of its parameters as 
single mode factor model parameters and for its use as an approximation to a separable covariance 
model with an unstructured covariance matrix in each mode. 
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A Appendix: Sampling A and D from their full conditional dis- 
tributions 



Let A* be the proposed value of A that results from A(i-ii). The acceptance probability for this 
proposal is 

(A* A) = MA?|y, A-j, D, T,)p(Aj\A* , D , E, A„,, Y) = p(y|A*, A_,, D, T J )p(A*\D i )p(A i \A*,D, E, A_,, Y) 
a{ *' l) p(Aj|Y, A_j, D, E)p(A?|A i; D, E, A_ i; y) p(F|A i; A_ is £>, S)p(A i | A)p(A*|Ai, A E, A_ 4 , y) 

The proposal probability can be written 

p(A*\Ai,D,E,A-i,Y) = J p(A*,Z i \A i ,D,E,A- i ,Y)dZ i 

= j p(A*\Z i ,D,E,A^ i ,Y)p(Z i \A i ,D,^,A_ i ,Y)dZ i 

= P(K\D^Y) J P ^^^ ^,^A^W 
_ p(Y\A*,D, E, A_ M A h D, E, A^ _ ^ ^ 



P (As,A_i,y) 

p(y|At, J D,E,A_ i )p(A*|A)p(£')p(S)p(A_ i | J D_i 



■c(Ai,A*|D, E,A_i,y) 



where c(Aj, A||D, E, A_j, y) represents the integral, which is symmetric in Aj and A*. Plugging 
the last expression into the acceptance probability, we obtain a(A*,Aj) = 1. Analogous logic can 
be used to show the acceptance probability for a proposed Di from B(i-ii) is also one. 



B Appendix: Alternative extension of factor analysis to arrays 

An alternative extension of factor analysis to arrays is motivated by the latent variable represen- 
tation of single model factor analysis in ^ . First consider extending the single mode factor model 
to estimate relationships among the rows and the columns of a matrix X, by writing 

X pxn = A\ZA% + D\ED\ = Z klXhz x {A x , A 2 } + E pxn x {D 1; D 2 }, (18) 

where W x {Ai, Ak} indicates the first mode of the if -way array W is left multiplied by A%, 
the second mode is left multiplied by A 2 , etc. As in SFA, Aj is {m\ x ki), and Di is diagonal, 
however we only consider ki > 0. If Z and E have the same distributional properties as in @, the 
covariance matrix of X is Cov[vec(X)] = (A 2 Aj ® AiAf ) + {D\ ® D\). 



29 



The analogous model for a K-w&y array Y has the following equivalent representations: 



Y = Z x {Ai, A K } + E x {Z>i, .., D K } (19) 

Cov[vec(F)] = (A A 'A^ ® ... ® AiAf) + ® ... <8> £>?) 

where Z is (fci x ... x kg), E is (mi x ... x mjf), and these again have the same distributional 
properties as in Q. The second moment of a matricization of the array is written 

E[y (i) Yg] = a.A^Af + 7i I>? ^ = JJ tr(A j Aj) 7f = JJ tr(D 2 ). 

Observe that the second moment has fcj-factor analytic structure and will be unstructured if Aj = 
Di = S^ 2 , where £*^ 2 is any non-singular mi x m» matrix. 



The model in (19) has many similarities to the higher-order singular value decomposition 



(HOSVD) ( |Tucker| ( [1966D , |De Lathauwer et al.| ( |2000| ». The HOSVD states that any A-way 



array Y can be written Y = G X {Ui, Uk}, where G is an all-orthogonal core matrix of the same 
dimension as Y and Ui is (mj x mi) satisfying UfUi = I for i G {1, K}. The slice of Y in 
the mode is that which results by setting the j th index of Y equal to i. An array is considered 
to be of reduced rank if slices of the core array G are zero. A AT- way array of rank (ri, r^) can 
be expressed by the HOSVD with a core matrix G of dimension (n x ... x r^) where each U is of 



dimension (m; x ri). The alternative array factor model in (19) can be written as Y = M + E where 
M = Z x {...} is the mean and common factor portion, and E = E x {...} is the error component. 
The mean structure of this model resembles the HOSVD if Z is viewed as a core array. Although 
Z is not all-orthogonal and Aj and £j do not satisfy A^Aj = {t}J 2 ) t t}J 2 = J, M can be rewritten 
using the HOSVD to obtain a new Z, A, and £j that satisfy the constraints. The error component 
E is then interpreted as accounting for variation in Y which is not captured by the reduced rank 
approximation M . 



The covariance model in (19) is a submodel of the single mode (Y\ fcj)-factor model for vec(Y) 
since the covariance matrix is comprised of a reduced rank matrix plus a diagonal matrix. SFA 
is equivalent to this alternative extension if zero or one mode is specified with factor analytic 
structure. A drawback of the HOSVD and this alternative extension is that mode parameter values 
are difficult to interpret and cannot be considered independently of parameters in other modes. 
For this reason, we chose to focus on SFA as the primary extension of factor analysis to arrays. 
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